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ABSTRACT 

Real-life graphs usually have various kinds of events hap- 
pening on them, e.g., product purchases in online social net- 
works and intrusion alerts in computer networks. The oc- 
currences of events on the same graph could be correlated, 
exhibiting either attraction or repulsion. Such structural 
correlations can reveal important relationships between dif- 
ferent events. Unfortunately, correlation relationships on 
graph structures are not well studied and cannot be cap- 
tured by traditional measures. 

In this work, we design a novel measure for assessing two- 
event structural correlations on graphs. Given the occur- 
rences of two events, we choose uniformly a sample of "ref- 
erence nodes" from the vicinity of all event nodes and em- 
ploy the Kendall's t rank correlation measure to compute 
the average concordance of event density changes. Signif- 
icance can be efficiently assessed by r's nice property of 
being asymptotically normal under the null hypothesis. In 
order to compute the measure in large scale networks, we de- 
velop a scalable framework using different sampling strate- 
gies. The complexity of these strategies is analyzed. Exper- 
iments on real graph datasets with both synthetic and real 
events demonstrate that the proposed framework is not only 
efficacious, but also efficient and scalable. 

1. INTRODUCTION 

In recent years, an increasing number of real-life networks 
have emerged and experienced a substantial growth, e.g. on- 
line social networks, WWW and Internet. A lot of works 
have been dedicated to research problems related to net- 
work structures, e.g. graph pattern mining [17, 25] and link 
analysis [6]. One important aspect of complex networks is 
that their nodes usually produce various kinds of data, which 
we abstract as events in this work. For instance, an eBay 
customer could sell or bid a product; a computer in Internet 
could suffer various attacks from hackers. This gives birth to 
the branch of research involving graph structures and events 
happening on graphs [8, 20, 16, 11]. 
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Figure 1: Two types of Two- Event Structural Cor- 
relation: (a) attraction and (b) repulsion. 

Two events occurring on the same graph could be cor- 
related. Two illustrative examples are shown in Figure 1. 
In Figure 1(a), A and B exhibit a positive correlation (at- 
traction). In the context of a social network, they could 
be two baby formula brands, Similac and Enfamil. Their 
distributions could imply that there exist "mother commu- 
nities" in the social network where different mothers would 
prefer different baby formula brands. The two brands at- 
tract each other because of the communities. An example 
of negative correlation (repulsion) could be that people in an 
Apple fans' community would probably not buy products of 
ThinkPad and visa versa, as conveyed by Figure 1(b). We 
name this kind of structural correlation as Two-Event Struc- 
tural Correlation (TESC). TESC is different from correlation 
in transactions such as market baskets. If we treat nodes of 
a graph as transactions and assess Transaction Correlation 
(TC) of two events by using measures such as Lift [12], one 
can verify that in Figure 1(a), A and B have a negative 
TC, although they exhibit a positive TESC. Regarding the 
baby formula example, a mother would probably stick to 
one brand, since switching between different brands could 
lead to baby diarrhea. As another example, in terms of 
computer networks, A and B could be two related intrusion 
techniques used by hackers to attack target subnets. Since 
attacks consume bandwidth, there is a tradeoff between the 
number of hosts attacked and the number of techniques ap- 
plied to one host. Hackers might choose to maximize cov- 
erage by alternating related intrusion techniques for hosts 
in a subnet, in order to increase the chance of success. We 
will show such examples in the experiments. Hence, TESC is 
useful for detecting structural correlations which might not 
be detected by TC. It can be used to improve applications 
such as online advertisement [4] and recommendation [14]. 
For instance, most recommendation methods exploit posi- 
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tive TC, while positive TESC provides an alternative recom- 
mendation scheme in local neighborhoods. TESC can reveal 
important relationships between events (the intrusion ex- 
ample) or reflect structural characteristics (communities in 
the product examples) on a graph. Nevertheless, this paper 
focuses on measuring TESC, but not finding its cause. 

Unfortunately, measuring TESC is not a trivial problem. 
A similar problem also exists in correlation test between two 
point sets in spatial data [7, 18, 23]. However, discrete graph 
space is intrinsically different from continuous spatial space 
and the existing techniques for the point pattern problem 
cannot be applied directly. More importantly, in the graph 
context, scalability is a major issue since real-life networks 
often contain millions or even billions of nodes and edges, 
while the point pattern problem only considers datasets of 
very limited sizes (e.g. several hundreds of points). 

Researchers have studied the distributions of events in 
graph spaces. Kahn et al. studied proximity pattern min- 
ing where the goal was to mine groups of events which fre- 
quently co-occurred in local neighborhoods in a graph [16]. 
However, (1) they only consider positive correlations (asso- 
ciation) among events, while TESC aims to measure both 
positive and negative correlations; (2) they rely on an em- 
pirical method for significance testing, while our method is 
rigorous statistical testing based on the Kendall's r statistic; 
(3) their problem is intrinsically a frequent pattern mining 
problem and could miss some rare but positively correlated 
event pairs, as will be shown in the experiments. In [11], 
we proposed a measure based on hitting time to assess the 
structural correlation within an event. That measure is not 
suitable for TESC in that if we adapt the measure to com- 
pute the affinity between two events, its distribution in the 
null case is difficult to estimate by simulations. It is hard 
to preserve each event's internal structure when simulating 
independence between them. 

In this paper, we propose a novel measure and then an 
efficient framework for computing TESC on graphs. Specifi- 
cally, given the occurrences of two events, we choose a sam- 
ple of "reference nodes" uniformly from the vicinity of all 
occurrences and compute for each reference node the den- 
sities of the two events in its vicinity, respectively. Then 
we employ the Kendall's r rank correlation measure [15] to 
compute the average concordance of density changes for the 
two events, over all pairs of reference nodes. Finally, cor- 
relation significance can be efficiently assessed by r's nice 
property of being asymptotically normal under the null hy- 
pothesis. For efficiently sampling reference nodes, different 
sampling techniques are proposed to shorten the statistical 
testing time. Our framework is scalable to very large graphs. 

Our Contributions We introduce a new structural corre- 
lation problem: Two- Event Structural Correlation (TESC), 
which measures whether and to what degree two events oc- 
curring on the same graph are correlated. TESC is a funda- 
mental problem that helps understand how different events 
are related with one another on a particular graph and can 
be insightful for many attributed graph mining applications. 
Second, we develop an efficient statistical testing framework 
for measuring TESC. The main idea is to compute the con- 
cordance of each pair of reference nodes with regard to the 
respective density changes of the two events, from one refer- 
ence node's vicinity to the other's. High concordance scores 
mean that the occurrence of one event tends to attract the 



occurrence of the other event (positive correlation), while 
low concordance scores mean the two events repulse each 
other (negative correlation). An important subproblem is 
how to efficiently sample reference nodes. To this end, we 
propose three different algorithms for efficient reference node 
selection and analyze their advantages and disadvantages 
both theoretically and empirically. Finally, we demonstrate 
the efficacy of the TESC testing framework by event simu- 
lations on the DBLP graph. We further test its scalability 
in a Twitter graph with 20 million nodes. Case studies of 
applying the testing framework on real events occurring on 
real graphs are provided with interesting results. 

2. PRELIMINARIES 

We consider an attributed graph G = (V, E) where an 
event set Q contains all events that occur on V. Each node 
v possesses a set of events Q v C Q which have occurred on 
it. For an event a £ Q, we denote the set of nodes having 
a as V a - In this paper, we use a and b to denote the two 
events for which we want to assess the structural correlation. 
For the sake of simplicity, we assume G is undirected and 
unweighted. Nevertheless, the proposed approach could be 
extended for graphs with directed and/or weighted edges. 

Problem Statement Given two events a and b and their 
corresponding occurrences V a and Vj,, to determine whether 
a and b are correlated (if correlated, positive or negative) in 
the graph space with respect to a vicinity level h. 

We formally define the notion of vicinity on a graph as 
follows. 

Definition 1 (Node Level-/i Vicinity). Given graph 
G — (V,E) and a node u 6 V, the level-h vicinity (or h- 
vicinity) of u is defined as the subgraph induced by the set 
of nodes whose distances from u are less than or equal to h. 
We use and E^ to denote the sets of nodes and edges in 
u's h-vicinity, respectively. 

Definition 2 (Node Set h- Vicinity). Given a graph 
G — (V, E) and a node set V' C V , the h-vicmity of V' is 
defined as the subgraph induced by the set of nodes which are 
within distance h from at least one node u £ V' . For event 

a, we use Va and to denote the sets of nodes and edges 
in V a 's h-vicinity, respectively. 

Let Vaut = V a U 14 denote the set of nodes having at 
least one of events a and b, i.e. all event nodes. The sets of 
nodes and edges in the h- vicinity of V aU b is denoted by V^;, 
and Ea Ub , respectively. To assess the structural correlation 
between a and b, we employ a set of reference nodes. 

Definition 3 (Reference Nodes). Given two events 
a and b on G, a node r £ V is a reference node for assessing 
level-h TESC between a and b iff r £ Vaub- 

Definition 3 indicates that we treat V aU b as the set of all 
reference nodes for assessing \evel-h TESC between a and 

b. The reason will be explained in Section 3.2. We define 
the notion of concordance for a pair of reference nodes as 
follows. 

Definition 4 (Concordance). Two reference nodes 
ri and rj for assessing level-h TESC between a and b are 
said to be concordant iff both a 's density and b 's density in- 
crease (or decrease) when we move from ri's h-vicinity to 
rj 's h-vicinity. 
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Mathematically, the concordance function c(ri,rj) is de- 
fined as 

f 1 (*)-^))(4W-^))>0 
c(n, rj ) = l -1 (si(r i )-8i(r i )){»b(r i )-si(r j ))<0 , 
[ otherwise 

(1) 

where sjj(rj) is the density of event a in r-j's fo-vicinity: 



c(ri,Vj) encodes concordance as 1 and discordance as -1. 
means r, and r,- are in a tie, i.e. s„(rj) = s„(rj) or 
s 6 ( r — s t ( r j)) which means the pair indicates neither con- 
cordance nor discordance. Regarding s^iri), the reason that 
we use \Vr- \ to normalize the occurrence number is that dif- 
ferent nodes could have quite different sizes of h- vicinities. 
| Vr- 1 can be regarded as an analogue to area in spatial spaces. 
Normalization makes all reference nodes' /i-vicinities have 
the same "area". The computation of s^{ri) is simple: we 
do a Breadth-First Search (BFS) up to h hops (hereafter 
to be called h-hop BFS) from n to count the number of 
occurrences of the event. More sophisticated graph proxim- 
ity measures could be used here, such as hitting time [19] 
and personalized PageRank [6]. However, the major issue 
with these sophisticated measures is the high computational 
cost. As will be demonstrated in experiments, our density 
measure is not only much more efficient but also effective. 

3. MEASURING TESC 

This section presents our TESC testing framework. First, 
we show the intuition behind using reference nodes to as- 
sess TESC. If events a and b are positively correlated on G, 
a region where a appears tends to also contain occurrences 
of b, and visa versa. Furthermore, more occurrences of one 
event will tend to imply more occurrences of the other one. 
On the contrary, when a and b are negatively correlated, 
the presence of one event is likely to imply the absence of 
the other one. Even if they appear together, an increase of 
occurrences of one event is likely to imply a decrease of the 
other. Figure 2 shows the four typical scenarios described 
above, n and r2 are two reference nodes. Here let us assume 
/i-vicinities (denoted by dotted circles) of n and ri have the 
same number of nodes so that we can treat the number of 
occurrences as density. We can see in Figure 2(a) and 2(b), 
when a and b attract each other, n and are concordant, 
implying an evidence of positive correlation. In the repul- 
sion cases (Figure 2(c) and 2(d)), r\ and ri are discordant, 
showing an evidence of negative correlation. Therefore, the 
idea is to aggregate all these evidences from all pairs of ref- 
erence nodes to assess TESC. 

The natural choice for computing the overall concordance 
among reference nodes with regard to density changes of the 
two events is the Kendall's r rank correlation [15], which 
was also successfully applied to the spatial point pattern 
correlation problem [7, 23]. For clarity, let N = \V^ ub \. Wc 
have N reference nodes: n, r2, ■ ■ ■ , rjv- The Kendall's r 
measure is defined as an aggregation of c(n,rj)'s 



~{a,b) = 



lN(N- 1) 



(3) 



r(o, b) lies in [—1, 1]. A higher positive value of r(a, b) means 
a stronger positive correlation, while a lower negative value 




Figure 2: Four illustrative examples showing that 
density changes of the two events between two ref- 
erence nodes show an evidence of correlation. 



means a stronger negative correlation. r(a, b) — means 
there is no correlation between a and b, i.e. the number of 
evidences for positive correlation is equal to that for negative 
correlation. 

3.1 The Test 

If N is not large, we can directly compute r(a,b) and 
judge whether there is a correlation (and how strong) by 
r(a,b). However, real-life graphs usually have very large 
sizes and so does N. It is often impractical to compute 
r(a, b) directly. We propose to sample reference nodes and 
perform hypothesis testing [24] to efficiently estimate TESC. 
In a hypothesis test, a null hypothesis Ho is tested against 
an alternative hypothesis H\. The general process is that 
we compute from the sample data a statistic measure X 
which has an associated rejection region C such that, if the 
measure score falls in C, we reject Ho, otherwise Ho is not 
rejected. The significance level of a test, a, is the probability 
that X falls in C when Hq is true. The p-value of a test is 
the probability of obtaining a value of X at least as extreme 
as the one actually observed, assuming Ho is true. In our 
case X is r and Ho is: events a and b are independent with 
respect to G's structure. The test methodology is as follows: 
first we choose uniformly a random sample of n reference 
nodes from Vaub\ then we compute the r score over sampled 
reference nodes (denoted by t{a,b)): 



t(a,b) 



En — 1 it— \ n / 

\n(n-l) 



(4) 



where , . . . , r^ n are the n sampled reference nodes; finally 
we estimate the significance of t(a, b) and reject Ho if the 
p-value is less than a predefined significance level. We use 
sjj to represent the vector containing densities of a measured 
in all n sample reference nodes' ft-vicinities where the i-th 
element is s^r^). Under Ho, r(a,b) is 0. Consequently, 
for a uniformly sampled set of reference nodes, any rank- 
ing order of s£ is equally likely for a given order of s a . It 
is proved that the distribution of t(a, b) under the null hy- 
pothesis tends to the normal distribution with mean and 



variance 



2 2(2n + 5) 
o — 



9n(n 



(5) 



The idea of the proof is to show the moments of t's distribu- 
tion under Ho converge to those of the normal distribution, 
and then apply the Second Limit Theorem [9]. Readers 
could refer to Chapter 5 of [15] for details. A good normal- 
ity approximation can be obtained when n > 30 [15]. When 
Sa{rki) = Sa(r kj ) or s b (r ki ) = s%(r kj ), c(r ki ,r kj ) can be 
0. This means there could be ties of reference nodes where 
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pairs in a tie show evidences of neither concordance nor dis- 
cordance. When ties are present in sjj and/or (often, the 
case is that a set of reference nodes only have occurrences of 
a or b in their h- vicinities), a 2 should be modified accord- 
ingly. Let l/mbe the number of ties in s a /s&. The variance 
of the numerator of Eq. (4) becomes [15] : 

1 ' 
<Tc =^Kn-l)(2n + 5)-Vu i (u i - l)(2u i + 5) 
18 f 

i— 

m 

-^Wi(«i-l)(2«i + 5)] 



1 



9n(n-l)(n-2) 
x Ui (ui - - 2)][^ - - 2)] 



1=1 
i 



l m 

+ 2n(n-l) E"' ( "' " ^XX^ - !)1 



(6) 



respectively. When these sizes all equal to 1, Eq. (6) reduces 
to Eq. (5) multiplied by [|n(n — l)] 2 , i.e. the variance of the 
numerator of Eq. (4) when no ties exist. By grouping terms 
involving Ui/vt together, one can verify that more (larger) 
ties always lead to smaller a 2 . a 2 is then modified as a 2 
divided by [|n(n — l)] 2 . Once the variance is obtained, we 
compute the significance (z-score) of the observed t(a, b) by 



z(a, b) = 



t(a,b) - E(t(a,b)) _ t(a,b) 
^/Var(t(a,b)f~ ~ ° 



For t we do not substitute the alternative normalization 
term (see Chapter 3 of [15]) for [^N(N — 1)] when ties are 
present, since it makes no difference on the significance re- 
sult, i.e. simultaneously dividing I]"^ 1 c(r fci , r fcj ) 
and cr c by the same normalization term, t is an unbiased 
and consistent estimator for r. In practice, we do not need 
to sample too many reference nodes since the variance of t 
is upper bounded by — (1 — r 2 ) [15], regardless of TV. 

3.2 Reference Nodes 

Given the occurrences of two events a and b on graph G, 
not all nodes in G are eligible to be reference nodes for the 
correlation estimation between a and b. We do not con- 
sider areas on G where we cannot "see" any occurrences of 
a or b. That is, we do not consider nodes whose /i-vicinitics 
do not contain any occurrence of a or b. We refer to this 
kind of nodes as out-of-sight nodes. The reasons are: (1) 
we measure the correlation of presence, but not the corre- 
lation of absence. The fact that an area does not contain a 
and b currently does not mean it will never have a and/or 
b in the future; (2) if we incorporate out-of-sight nodes into 
our reference set, we could get unexpected high z-scores, 
since in that case we take the correlation of absence into 
account. Out-of-sight nodes introduce two 0-ties contain- 
ing the same set of nodes into sj and s^ , respectively. As 
shown in the toy example of Figure 3, the two 0-ties con- 
tain r% through r$. Adding r§ through rg to the reference 
set can only increase the number of concordant pairs, thus 
increasing S"=i+i c ( r k, ,r kj ). Moreover, the variance 

OI HlZi Sj=i+i c ( r fc; ' r „, ) under the null hypothesis is rel- 
atively reduced (Eq. (6)). These two factors tend to lead to 
an overestimated z-score. Therefore, given two events a and 
6, we treat V^ ub as the set of all reference nodes for assessing 



\evel-h TESC between a and b. It means we should sample 
reference nodes within V^ ub , otherwise we would get out-of- 
sight nodes. This is different from the spatial point pattern 
correlation problem where point patterns are assumed to be 
isotropic and we can easily identify and focus on regions 
containing points. In the next section, we study how we can 
do reference node sampling efficiently. 

/i r 2 r 3 r 4 r 5 \ r 6 r 7 r 8 r 9 
s a [0.0, 0.3, 0.1, 0.0, 0.4,10.0, 0.0, 0.0, 0.0] 

s b [0.4, 0.6, 0.0, 0.7, 0.8/jO.O, 0.0, 0.0, 0.0] 

Figure 3: s a and s b when we incorporate nodes whose 
/i-vicinities do not contain any occurrence of a or b. 



4. REFERENCE NODE SAMPLING 

In this section we present efficient algorithms for sampling 
reference nodes from V^ ub . We need to know which nodes 
are within V^ ub , but only have V a ub in hand. For continuous 
spaces, we can perform range search efficiently by building 
R-tree [3] or k-d tree [5] indices. However, for graphs it is dif- 
ficult to build efficient index structures for answering range 
queries, e.g. querying for all nodes in one node's /i-vicinity. 
Pre-computing and storing pairwise shortest distances is not 
practical either, since it requires 0(|V^| 2 ) storage. In the fol- 
lowing, we first propose an approach which employs BFS 
to retrieve all nodes in Vaubt an< i then randomly chooses n 
nodes from V^ ub . Then we present efficient sampling algo- 
rithms which avoid enumerating all nodes in V^ ub . Finally, 
we analyze time complexity of these algorithms. 

4.1 Batch_BFS 

The most straightforward method for obtaining a uniform 
sample of reference nodes is to first obtain V^ ub , and then 
simply sample from it. V^ut can De obtained by performing 
a /i-hop BFS search from each node v £ V a ub an d doing 
set unions. However, this strategy would perform poorly 
since the worst case time complexity is 0(|V r a ut|(|V'| + \E\)). 
The problem is that the /i-vicinities of nodes in V a ub could 
have many overlaps. Therefore, we adopt a variant of fe-hop 
BFS search which starts with all nodes in V a ub as source 
nodes. For clarity, we show the algorithm Batch_BFS in Al- 
gorithm 1. It is similar to the /i-hop BFS algorithm for one 
source node, except that the queue Queue is initialized with 
a set of nodes. The correctness of Batch_BFS can be easily 
verified by imagining that we do a (h + l)-hop BFS from 
a virtual node which is connected to all nodes in V a ub- By 
means of Batch_BFS, the worst case time complexity is re- 
duced from 0(|Kub|(|^| + |£|)) toO(|V| + |£|), which means 
for each node in the graph we do adjacency list examination 
at most once. As we will show in experiments, though sim- 
ple, Batch_BFS is a competitive method for reference node 
selection. 

4.2 Importance Sampling 

Though Batch_BFS algorithm is efficient in that its worst 
case time cost is linear in the number of nodes plus the num- 
ber of edges in the graph, it still enumerates all N reference 
nodes. In practice, the sample size n is usually much smaller 
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Algorithm 1: Batch_BFS 

Input: Adjjists: Adjacency lists for all nodes in G, V al jb- 

the set of all event nodes, h: # of hops 
Output: Vout- all nodes in /i-vicinity of V a ub 
begin 

Initialize Vout = 0- 

Initialize queue Queue with all v 6 V aUb and set 
v. depth = 0. 

while Queue is not empty do 
v = Dequeue(Que?ie) 
foreach u in Adjjists(v) do 

if m (f: Vout and u ^ Queue then 
u. depth = v. depth + 1 
if u. depth /i then 

| V out = Vout U {«} 

else 

Enqueue(QMeue,n) 
end 
end 
end 

Vout = Vout U {v} 

end 
end 



Procedure RejectSamp(V r aU t > ) 

1 Select a node v e V aUb with probability \V^\/N Bum . 

2 Sample a node u from Vj" uniformly. 

3 Get the number of event nodes in u's /i-vicinity: 

c=|v*nv oUi |. 

4 Flip a coin with success probability - . Accept u if we 
succeed, otherwise a failure occurs. 



than N and can be treated as a constant since we can fix 
n for testing different pairs of events. Hence, the question 
is, can we develop reference node selection algorithms which 
have time costs depending on n, rather than N? 




Figure 4: /i-vicinities of event nodes. 

The idea is that we directly sample reference nodes with- 
out firstly enumerating the whole set of reference nodes. It is 
challenging since we want to sample from the uniform prob- 
ability distribution over Va Ub , but only have V a ub in hand. 
The basic operation is randomly picking an event node in 
Vaub and peeking at its ft-vicinity. It is not easy to achieve 
uniform sampling. On one hand, the /i-vicinities of event 
nodes could have many overlapped regions, as illustrated by 
Figure 4. Circles represent ft-vicinities of the corresponding 
nodes and shadowed regions are overlaps. Nodes in over- 
lapped regions are easier to be selected if we sample nodes 
uniformly from a random event node's /i-vicinity. On the 
other hand, different nodes have /i-vicinities with different 
node set sizes, i.e. |, conveyed by circle sizes in Figure 4. 
If we pick event nodes uniformly at random, nodes in small 
circles tend to have higher probabilities to be chosen. 



We can use rejection sampling [10] to achieve uniform 
sampling in V^ ub , if we know |Xv| for each v £ V aU b- Let 
N S um = X^ugv b I^V I ^ e the sum of node set sizes of all 
event nodes' /i-vicinities. It is easy to verify N S um > N 
due to overlaps. The sampling procedure is shown in Pro- 
cedure RejectSamp. Proposition 1 shows that RejectSamp 
generates samples from the uniform probability distribution 
over Vaub- \Vv\' s [h = 1, . . . , h m ) can be pre-computed of- 
fline by doing a ft m -hop BFS from each node in the graph. 
The space cost is only 0(| V\) for each vicinity level and once 
we obtain the index, it can be efficiently updated as the 
graph changes. The time cost depends on |V| and the aver- 
age size of node /i m -vicinities, i.e. average \V^ m \ + \E^ m \. 
Fortunately, we do not need to consider too high values of 
h since (1) correlations of too broad scales usually do not 
convey useful information and (2) in real networks like social 
networks, increasing h would quickly let a node's ft-vicinity 
cover a large fraction of the network due to the "small world" 
phenomenon of real-life networks [2]. Therefore, we focus on 
relatively small h values, such as h — 1, 2, 3. 

Proposition 1. RejectSamp generates each node in V^ub 
with equal probability. 

Proof. Consider an arbitrary node u € V^ ub . In step 
2 of RejectSamp, u has a chance to be sampled if a node 
v £ V^nVaub is selected in step 1. Thus, the probability that 

u is generated after step 2 is Z)„ e yh n v oUl X = 

— — This is a non- uniform probability distribution 

over Va Ub . Then by the discount in step 4, u is finally gener- 
ated with probability — , which is independent of it. □ 

Each run of RejectSamp incurs a cost of two /i-hop BFS 
searches (step 2 and 3). Simply repeating RejectSamp until 
n reference nodes are obtained will generate a uniform sam- 
ple of reference nodes. However, each run of RejectSamp 
could fail. The success probability of a run of RejectSamp is 
Psucc = N/N S um, which can be easily derived by aggregat- 
ing success probabilities of all nodes in V^ ub . When there is 
no overlap among event nodes' ft-vicinities, p SU cc = 1 since 
N 3 um = N. The expected time cost in terms of /i-hop BFS 
is 2n/p SUC c. It means the heavier the overlap among different 
event nodes' ft-vicinities is, the higher the cost is. Consid- 
ering the "small world" property of real-life networks [2] , it 
would be easy to get a heavy overlap as V a ub and h grow. 
Preliminary experiments confirm RejectSamp is inefficient. 

We propose a weighting technique to address the above 
problem. The idea is similar to importance sampling [13]. In 
particular, we use the same sampling scheme with RejectSamp 
except that we do not reject any sampled nodes. This 
leads to samples generated from the nonuniform distribution 
P = {p{v)} veV h where p(v) = \V^ n V a ub\/N sum . Notice 
that t(a, b) is intrinsically an estimator of the real correla- 
tion score r(a,b). The idea is, if we can derive a proper 
estimator for r(a, b) based on samples from P, we could use 
it as a surrogate to t(a, b). Let 5 = {(ri, Wi), . . . , (r„, w n )} 
be a set consisting of n distinct reference nodes sampled 
from P, where Wi is the number of times n is generated in 
the sampling process. We denote the sample size of S as 
n' = X]"=i w i- We define a new estimator for r(a,b) based 
on 5 

i=i 2^=i+i c ^>^Jp( ri )p( r .) 
t(a.b) = 5 —. (8) 

Z-^/i—l A^j — i+l p(r i )p(rj) 
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Algorithm 2: Importance sampling 



Input: V a ijb: the set of all event nodes, \V^\: /i-vicinity 

node set sizes for all v £ V al j b , h: # of hops 
Output: 5: a set of n sampled reference nodes, W: the set 
of weights (frequencies) for each r £ S 

1 begin 

2 Initialize 5 = 0. 

3 while \S\ < n do 

4 Randomly select a node v £ V a ij b with probability 
\V*\/N avm . 

Do a h-hop BFS search from v to get Vj 1 and sample 
a node r from uniformly, 
if r £ 5 then 

| VF(r) = W(r) + 1 
else 

S=5U{r} 
W(r) = 1 

end 
end 
end 



This estimator is a consistent estimator of r(a,fe), which is 
proved in Theorem 1. 

Theorem 1. t(a,b) is a consistent estimator ofr(a,b). 
Proof. To prove t (a, 6) is a consistent estimator for r(a,b), 

we need to show that t(a, b) —} r(a, b), i.e. t{a, b) converges 
to r(o, b) in probability as the sample size n' — > oo. For each 
ri, we define a Bernoulli random variable X ri which is 1 if a 
run of sampling from P outputs node rt, and otherwise, 
is the sample mean for X Ti . By the Law of Large Numbers, 
as n! — > oo, converges in probability to the expectation 
E(X ri ) = p(ri). Moreover, all nodes in V^jb will be added 
into 5 when n' — s* oo, which means n = N. Therefore, as 
n' — > oo, we can obtain: 



t(a,b) =- 



p(r;)p(r ■) 



EJV-1 
j=i 2^ 



y-JV-ly-JV , N 



j=i+l p(r$)p(rj) 
p(r i )p(r J ) 
' p( r i)K'"j) 



EJV-1 ■spiV 
i — 1 L~ij — i 



p(r i )p(r j ) 



r(a,6), 



which completes the proof. □ 



It is easy to verify that t(a, b) is a biased estimator by 
considering a toy problem and enumerating all possible out- 
puts of a sample of size n' (together with their probabilities) 
to compute E(t(a,b)). However, unbiasedness used to re- 
ceive much attention but nowadays is considered less impor- 
tant [24]. We will empirically demonstrate that t(a,b) can 
achieve acceptable performance in experiments. For clarity, 
we show the Importance sampling algorithm in Algorithm 2. 
In each iteration of the sampling loop, the major cost is one 
h-hop BFS search (line 5). The number of iterations n', 
though > n, is typically w n in practice. This is because 
when TV is large, the probability of selecting the same node 
in different iterations is very low. Thus, the major cost of 
Importance sampling could be regarded as depending on n. 
Once S and W are obtained, we can then compute t(a, b) as 
a surrogate for t(a, b) and assess the significance accordingly. 

Improving Importance Sampling Although the time 
cost of Importance sampling depends on n rather than N, in 



Algorithm 3: Whole graph sampling 

Input: V a u6- the set of all event nodes, V: all nodes in 
graph G 

Output: cS: a set of n sampled reference nodes 

1 begin 

2 S=% 

3 while \S\ < n do 

4 Randomly pick a node v £ V 

5 Do a ft-hop BFS search from v to get Vj 1 
if V v h n V aUb + then 

| s = 5uH 

end 

V = V - {v} 
end 
end 



practice n h-hop BFS searches could still be slower than one 
Batch_BFS search as h increases. This is because the over- 
lap among different event nodes' h- vicinities tends to become 
heavier as h increases. We can alleviate this issue by sam- 
pling reference nodes in a batch fashion. That is, when V v is 
obtained for a sampled v £ V aU b (line 5 of Algorithm 2), we 
sample more than one reference nodes from . In this way, 
the ratios between different reference nodes' probabilities of 
being chosen do not change. However, this also introduces 
dependence into S. Sampling too many nodes from one Vv 
would degrade performance since the number of event nodes 
peeked at decreases and consequently we are more likely to 
be trapped in local correlations. This is a tradeoff between 
efficiency and accuracy. We will test this approximation idea 
in experiments. 

4.3 Global Sampling in Whole Graph 

When | V^ufe I an d h increase, the chance that a random 
node selected from the whole graph is in V^ ub also increases. 
In this situation, we can simply sample nodes uniformly in 
the whole graph and the obtained nodes which are within 
V aU i, can be regarded as a uniform sample from V^ub- We 
use an iterative process to harvest reference nodes: (1) firstly 
a node is chosen uniformly from the whole graph; (2) test 
whether the selected node is within V^ ub ; (3) if it is in V^ ub , 
keep it. (4) another node is selected uniformly from the 
remaining nodes and we go to step 2. This process continues 
until n reference nodes are collected. For completeness, the 
Whole graph sampling algorithm is shown in Algorithm 3. 
The major cost is incurred by one /i-hop BFS search in each 
iteration (line 5), where the purpose is to examine whether 
v is an eligible reference node. 

4.4 Complexity Analysis 

The major space cost is 0(|i?|), for storing the graph as 
adjacency lists. Regarding time complexity, we have mainly 
three phases: reference node sampling, event density compu- 
tation (Eq. (2)) and measure computation (z-score, Eq. (7)). 
Let cb be the average cost of one /i-hop BFS search on graph 
G, which is linear in the average size of node /i-vicinities, i.e. 
average |V^| + \E^\. Let n be the number of sample refer- 
ence nodes. The event density computation for a reference 
node has time complexity O(cs). The cost of z-score com- 
putation is 0(n 2 ). Fortunately, we do not need to select 
too many reference nodes, as discussed in Section 3.1. We 
will demonstrate the efficiency of the above two phases in 
experiments. 
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For reference node sampling, we have three methods. The 
time complexity of Batch_BFS is 0(|V^j 6 | + \E% ub \) where 
IKtuftl = N. The cost of Importance sampling is O(ncs). 
For Whole graph sampling, the time cost is 0(n/Cs), where 
rif is the number of nodes examined which are not in V^ ub . 
The cost incurred by examined nodes which are in V a u D is 
counted in the event density computation phase, nj is a 
random variable. Treating Whole graph sampling as sam- 
pling with replacement, the probability of selecting a node in 
Vaub m each iteration is iV/| V| . The expected total number 
of iterations is n|V|/AT and therefore E(nf) = n| V|/iV — n. 
When N is small, Batch_BFS can be used. For large N, 
Importance sampling and Whole graph sampling are better 
candidates. We will empirically analyze their efficiency in 
the experiments. 

5. EXPERIMENTS 

This section presents the experimental results of applying 
our proposed TESC testing framework on several real world 
graph datasets. Firstly, we verify the efficacy of the pro- 
posed TESC testing framework by event simulation on the 
DBLP graph. Then we examine the efficiency and scala- 
bility of the framework with a Twitter network. The third 
part of experiments concentrates on analyzing highly cor- 
related real event pairs discovered by our measure in real 
graph datasets. All experiments are run on a PC with Intel 
Core i7 CPU and 12GB memory. 

5.1 Graph Datasets 

We use three datasets to evaluate our TESC testing frame- 
work: DBLP, Intrusion and Twitter. 

DBLP The DBLP dataset was downloaded on Oct. 16th, 
2010 (http://www.informatik.uni-trier.de/~ley/db). Its pa- 
per records were parsed to obtain the co-author social net- 
work. Keywords in paper titles are treated as events associ- 
ated with nodes (authors) on the graph. The DBLP graph 
contains 964,677 nodes and 3,547,014 edges. Totally, it has 
around 0.19 million keywords. 

Intrusion The Intrusion dataset was derived from the log 
data of intrusion alerts in a computer network. It has 200,858 
nodes and 703,020 edges. There are 545 different types of 
alerts which are treated as events in this network. 

Twitter The Twitter dataset has 20 million nodes and 
0.16 billion edges, which is a bidirectional subgraph of the 
whole twitter network (http://twitter.com). We do not have 
events for this dataset. It is used to test the scalability of 
the proposed TESC testing framework. 

5.2 Event Simulation 

A suitable method for evaluating the efficacy of our ap- 
proach is to simulate correlated events on graphs and see if 
we can correctly detect correlations. Specifically, we adopt 
similar methodologies as those used in the analogous point 
pattern problem [7] to generate pairs of events with positive 
and negative correlations on graphs. DBLP network is used 
as the test bed. We investigate correlations with respect 
to different vicinity levels h — 1,2,3. Positively correlated 
event pairs are generated in a linked pair fashion: we ran- 
domly select 5000 nodes from the graph as event a and each 
node v € V a has an associated event b node whose distance 



to v is described by a Gaussian distribution with mean zero 
and variance equal to h (distances go beyond h are set to 
h). When the distance is decided, we randomly pick a node 
at that distance from v as the associated event b node. This 
represents strong positive correlations since wherever we ob- 
serve an event a, there is always a nearby event b. For neg- 
ative correlation, again we first generate 5000 event a nodes 
randomly, after which we employ Batch_BFS to retrieve the 
nodes in the h- vicinity of V a , i.e. Va- Then we randomly 
color 5000 nodes in V\Va as having event b. In this way, ev- 
ery node of b is kept at least h + 1 hops away from all nodes 
of a and the two events exhibit a strong negative correla- 
tion. For each vicinity level, we generate 100 positive event 
pairs and 100 negative event pairs from the above simula- 
tion processes, respectively. We use recall as the evaluation 
metric which is defined as the number of correctly detected 
event pairs divided by the total number of event pairs (100). 
We report results obtained from one-tailed tests with signif- 
icance level a — 0.05. In our experiments, we empirically 
set the sample size of reference nodes n = 900. 

5.2.7 Performance Comparison 

We investigate the performance of three reference node 
sampling algorithms, namely, Batch_BFS, Importance sam- 
pling and Whole graph sampling, under different vicinity 
levels and different noise levels. Noises are introduced as 
follows. Regarding positive correlation, we introduce a se- 
quence of independent Bernoulli trails, one for each linked 
pair of event nodes, in which with probability p the pair is 
broken and the node of b is relocated outside Va- For neg- 
ative correlation, given an event pair each node in VJ, has 
probability p to be relocated and attached with one node 
in V a - The probability p controls to what extent noises are 
introduced and can be regarded as noise level. 

We show the experimental results in Figure 5 and 6, for 
positive correlation and negative correlation, respectively. 
As can be seen, overall the performance curves start from 
100% and fall off as the noise level increases. This indi- 
cates that the proposed statistical testing approach is effica- 
cious for measuring TESC. Among the three reference node 
sampling algorithms, Batch_BFS achieves relatively better 
performance. Importance sampling, though not as good as 
Batch_BFS, can also achieve acceptable recall, especially for 
h = 1,2. We shall show in Section 5.3 that Importance 
sampling is more efficient than Batch_BFS in many cases. 
Whole graph sampling also shows good recall in most cases, 
as expected. However, its running time can vary drastically 
and therefore it can only be applied in limited scenarios. 
An interesting phenomenon is that positive correlations for 
higher vicinity levels (e.g. 3) are harder to break than those 
for lower levels, while for negative correlations it is the re- 
verse: lower level ones are harder to break. Note that the 
noise level ranges in subfigurcs of Figure 5 and 6 are not the 
same. This is intuitive. Consider the size of Va - When h 
increases, \Va\ usually increases exponentially. For exam- 
ple, among our synthetic events in DBLP graph, the typical 
size of Va 1 is 60k while that of V^ is 700k (7/10 of the whole 
graph), for \V a \ = 5000. Hence, it is much harder for event 
b to "escape" event a for higher vicinity levels. On the con- 
trary, for h = 1 it is easier to find a node whose 1-vicinity 
dose not even overlap with Va ■ Hence, low vicinity level 
positive correlations and high vicinity level negative corre- 
lations are hard to maintain and consequently more inter- 
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(a) h = 1, positive (b) h = 2, positive (c) /i = 3, positive 

Figure 5: Performance of three reference node sampling algorithms on simulated positively correlated event 
pairs. Results for various noise levels are reported under different vicinity levels. 
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(a) h = 1, negative (b) h = 2, negative (c) h — 3, negative 

Figure 6: Performance of three reference node sampling algorithms on simulated negatively correlated event 
pairs. Results for various noise levels are reported under different vicinity levels. 



0.9ff 
0.9 









-©-Positive, h=3, noise=0.1 
-a- Positive, h=2, noise=0 
-*- Negative, h=3, noise=0 
-0- Negative, h=2, noise=0.5 





5 10 15 20 

# reference nodes sampled from V„ 



Figure 7: Performance of sampling different num- 
ber of reference nodes from each for Importance 
sampling. 

csting that those in other cases. In the following experiment 
on real events, we will focus on these interesting cases. 

5.2.2 Batch Importance Sampling 

In Importance Sampling, when is obtained for a sam- 
pled v £ Vaut (line 5 of Algorithm 2), we could sample more 
than one node from as reference nodes, in order to re- 
duce the cost. However, sampling too many nodes from one 
V„ would degrade performance since the number of event 
nodes peeked at decreases and consequently we are more 
likely to be trapped in local correlations. Here we present 
an empirical evaluation of this idea for h = 2,3. We show 



results for four synthetic event pair sets in Figure 7. Two 
of those sets contain noises since in the corresponding cases 
the correlation is hard to break, which means in those cases 
it is easy to detect correlations. We can see that the re- 
sults are as expected. The performance curves for h = 3 
can keep high for a longer range of the number of reference 
nodes sampled from each Vy, compared to h = 2. This is 
because 3- vicinities are usually much larger than 2- vicinities 
and 3-vicinities of event nodes tend to have more overlapped 
regions. Therefore, sampling a batch of reference nodes from 
3-vicinities is less likely to be trapped in local correlations 
than from 2- vicinities. The results also indicate that we can 
sample a small number of reference nodes from each V„ for 
Importance sampling, without severely affecting its perfor- 
mance. In the following efficiency experiments, we set this 
number to 3 and 6 for h — 2 and h = 3 respectively. 

5.2.3 Impact of Graph Density 

We change the graph density to see the impact on cor- 
relation results. Specifically, we alter the DBLP graph by 
randomly adding/removing edges and run Batch_BFS for 
the six event pair sets (without noises) generated in Sec- 
tion 5.2. Figure 8 shows the results. We can see when 
removing edges, the recall of positive pairs decreases, while 
adding edges leads to recall decline of negative pairs. In the 
remaining cases (e.g. negative pairs vs. edge removal) the 
recall remains at 1. This is because removing edges tends to 
increase distances among nodes while adding edges makes 
nodes near one another. Figure 8(a) shows that 1-hop posi- 
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Table 1: Five keyword pairs exhibiting high 1-hop 
positive correlation (DBLP). All scores are z-scores. 



# 


Pair 


TESC 


TC 


h= 1 


h = 2 


h = 3 


1 


Texture vs. Image 


6.22 


19.85 


30.58 


172.7 


2 


Wireless vs. Sensor 


5.99 


23.09 


32.12 


463.7 


3 


Multicast vs. Network 


4.21 


18.37 


26.66 


123.2 


4 


Wireless vs. Network 


2.06 


17.41 


27.90 


198.2 


5 


Semantic vs. RDF 


1.72 


16.02 


24.94 


120.3 



Figure 8: Impact of randomly removing or adding 
edges on the correlation results. 



tive correlations are less influenced by edge removal, which 
is different from the observation in Section 5.2.1, i.e. 1- 
hop positive correlations are easier to break. The reason is 
that in our correlation simulation model 1-hop positive event 
pairs tend to have more nodes with both events, due to the 
Gaussian distributed distances between event b nodes and 
corresponding event a nodes. Nodes with both events re- 
flect transaction correlation which is not influenced by edge 
removal. However, TESC does not just measure transaction 
correlations. We will show in Section 5.4 there are real event 
pairs which exhibit high positive TESC but are independent 
or even negatively correlated by transaction correlation. 

5.3 Efficiency and Scalability 

We test efficiency and scalability of our TESC testing 
framework on Twitter graph. First we investigate the run- 
ning time of different reference node sampling algorithms 
with respect to the number of event nodes, i.e. the size 
of Vaub- In particular, we randomly pick nodes from the 
Twitter graph to form V aU b with sizes ranging from 1000 
to 500000. Then each algorithm is run to generate sample 
reference nodes for these V aU f>'s in order to record its run- 
ning time. Results are averaged over 50 test instances for 
each size of V aU b- Figure 9 shows the results for the three 
vicinity levels. To keep the figures clear, we do not show the 
running time of Whole graph sampling for some cases since 
its running time goes beyond 10 seconds. We can see that 
for different vicinity levels the situations are quite different. 
Generally speaking, the running time of BFS increases sig- 
nificantly as Vaub grows, while that of Importance sampling 
hardly increases. This is consistent with our analysis in Sec- 
tion 4.4. The running time of Importance sampling increases 
a little in that the algorithm tends to choose event nodes 
with large V? to peek in the sampling loop. By chance, 
there would be more and more event nodes with large sizes 
of Vv as Vaub grows. We can see Importance sampling is 
definitely more efficient than Batch J3FS when h = 1. For 
h = 2 and 3, when the size of V a ub is small, we can use 
Batch_BFS; for large sizes of V a ub, Importance sampling is 
a better choice. Whole graph sampling is recommended only 
for h — 3 and for large sizes of V a ub (above 200k in the case 
of Twitter graph). To conclude, the results indicate our ref- 
erence sampling algorithms are efficient and scalable, i.e., 
we can process V a ub with 500K nodes on a graph with 20M 
nodes in 1.5s. 

Besides reference node sampling, the TESC testing frame- 
work also needs to do one h-hop BFS search for each sample 
reference node to compute event densities and then calculate 



Table 2: Five keyword pairs exhibiting high 3- hop 
negative correlation (DBLP). All scores are z-scores. 



# 


Pair 


TESC 


TC 


h = 1 


h = 2 


h = 3 


1 


Texture vs. Java 


-23.63 


-9.41 


-6.40 


4.33 


2 


GPU vs. RDF 


-24.47 


-14.64 


-6.31 


1.24 


3 


SQL vs. Calibration 


-21.29 


-12.70 


-5.45 


-0.62 


4 


Hardware 
vs. Ontology 


-22.31 


-8.85 


-5.01 


3.38 


5 


Transaction 
vs. Camera 


-22.20 


-7.91 


-4.26 


4.85 



z(a,b). Figure 10 shows that these two operations are also 
efficient and scalable. Figure 10(a) indicates that on a graph 
with 20 million nodes, one 3-hop BFS search only needs 
5.2ms, which is much faster than the state-of-art hitting time 
approximation algorithm (170ms for 10 million nodes) [11]. 
Efficiency is the major reason that we choose this simple 
density measure, rather than more complicated proximity 
measures such as hitting time. On the other hand, although 
the measure computation has time complexity 0(n 2 ), we 
do not need to select too many reference nodes since the 
variance of t(a, b) is upper bounded by |(l-r(a,6) 2 ) [15], 
regardless of N. Figure 10(b) shows we can compute z(a, b) 
in 4ms for 1000 reference nodes. 

5.4 Real Events 

We provide case studies of applying our TESC testing 
framework on real events occurring in real graphs. We use 
Batch_BFS for reference node selection. As aforementioned 
in Section 5.2.1, low level positive correlations and high level 
negative correlations are of interests. Hence, we report typ- 
ical highly correlated event pairs we found in DBLP and 
Intrusion datasets in terms of 1-hop positive TESC and 3- 
hop negative TESC respectively. We report z-scores as the 
significance scores of the correlated event pairs. To give a 
notion of the correspondence between z-scores and p- values, 
a z-score > 2.33 or < —2.33 indicates the corresponding 
p-value < 0.01 for one-tailed significance testing. Before 
presenting the results, we would like to emphasize that our 
correlation findings are for specific networks and our mea- 
sure detects exhibition of correlation, but not its cause. 

Tables 1 and 2 show the results for DBLP. For compar- 
ison, we also show correlation scores measured by treating 
nodes as isolated transactions. We use Kendall's Tb [1] to 
estimate the Transaction Correlation (TC) since Tb can cap- 
ture both positive and negative correlations. All scores in 
the Tables are z-scores. We can see that highly positively 
correlated keywords are semantically related and reflect hot 
research areas in different communities of computer science, 
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Figure 9: Running time of reference node sampling algorithms with increasing number of event nodes. 
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Figure 10: Running time of one ft-hop BFS search 
and z(a,b) computation. 



Table 4: Five alert pairs exhibiting high 2-hop nega- 
tive correlation (Intrusion). All scores are z-scores. 



# 


Pair 


TESC (h = 2) 


TC 


1 


Audit _TFTP_Gct .Filename 
vs. LDAP_Auth_Failcd 


-31.30 


-0.81 


2 


LDAP_Auth_Failed 
vs. TFTP_Put 


-31.12 


-0.81 


3 


DPS_Magic_Number_DoS 
vs. HTTP_Auth_TooLong 


-30.96 


-0.18 


4 


LDAP_BER_Sequence_Dos 
vs. TFTP_Put 


-30.30 


-1.57 


5 


EmaiLExecutable_Extcnsion 
vs. UDP_Service_Sweep 


-26.93 


-0.97 



Table 3: Five alert pairs exhibiting high 1-hop posi- 
tive correlation (Intrusion). All scores are z-scores. 



# 


Pair 


TESC (h = 1) 


TC 


1 


Ping_Sweep 

vs. SMB_Service_Sweep 


13.64 


1.91 


2 


Ping_Flood vs. ICMP_Flood 


12.53 


5.87 


3 


EmaiLCommand_Overflow 
vs. EmaiLPipe 


12.15 


-0.04 


4 


HTML_Hostname_Overflow 
vs. HTML_NullChar_Evasion 


9.08 


0.59 


5 


Email_Error vs. EmaiLPipe 


4.34 


-3.52 



while negatively correlated ones represent topics which are 
far away from each other. In DBLP, keyword pairs having 
positive TESC tend to also have positive TC. However, for 
the negative case the results are not consistent. We can see 
in Table 2 many pairs have positive TC. It means although 
some authors have used both two keywords, they are far 
away in the graph space, reflecting the fact that they repre- 
sent quite different topics pursued by different communities 
in the co-author social network. 

Results for the Intrusion dataset are presented in Tables 3 
and 4. Since the Intrusion graph contains several nodes with 
very high degrees (around 50k), its diameter is much lower 
than that of DBLP. In the Intrusion graph, 2-vicinity of a 
node tends to cover a large number of nodes. Therefore, for 
negative TESC we focus on h = 2. As shown in Table 3, 
positively correlated alerts reflect high-level intrusion activ- 
ities. The first pair reflects pre-attack probes. The second 
one is related to ICMP DOS Attack. The third and fifth 



pairs indicate that the attacker is trying to gain root ac- 
cess of those hosts by vulnerabilities in email softwares and 
services. The fourth one is related to Internet Explorer's 
vulnerabilities. Notice that the third pair is nearly indepen- 
dent and the fifth pair is negatively correlated under TC. 
The reason could be that some attacking techniques con- 
sume bandwidth and there is a tradeoff between the number 
of hosts attacked and the number of techniques applied to 
one host. Attackers might choose to maximize coverage by 
alternating related intrusion techniques for hosts in a sub- 
net, in order to increase the chance of success. Although 
these alerts represent related techniques, they do not ex- 
hibit positive TC. TESC can detect such positive structural 
correlations. 

On the other hand, highly negatively correlated alerts are 
those related to different attacking approaches, or in con- 
nection with different platforms. For example, in the first 
pair of Table 4 LDAP_Auth_Failed is related to brute-force 
password guessing, while Audit_TFTP_Get_Filename is re- 
lated to TFTP Attack which allows remote users to write 
files to the target system without any authentication; in 
the third pair, DPS_Magic_Number_DoS is exclusive for Mi- 
crosoft Dynamics GP software while HTTP_Auth_TooLong 
is for Netscape Enterprise Server software. These pairs also 
exhibit moderate negative TC. 

We also compare our results with those produced by the 
proximity pattern mining problem [16] for the positive case. 
Specifically, we set minsup = 10/| 1^ | for the pFP algorithm 
and a = 1, e = 0.12 [16]. Then we run the proximity pattern 
mining method on the Intrusion dataset. From the results, 
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Table 5: Two rare alert pairs with positive 1-hop 
TESC which are not discovered by proximity pattern 
mining. 



Pair (count) 


z- score / p- value 


HTTPJEJ3cript_HRAlign_Overfiow (16) 
vs. HTTP_DotDotDot (29) 


3.30/0.0005 


HTTPJSA_Rules_Engine_Bypass (81) 
vs. HTTP_Script_Bypass (12) 


2.52/0.0059 



we find that most highly positively correlated pairs detected 
by TESC are also reported as proximity patterns, or subsets 
of proximity patterns. However, some rare event pairs de- 
tected by TESC are not discovered by the proximity pattern 
mining method. Table 5 shows two such examples. Digits 
in parentheses are event sizes. The reason is that proxim- 
ity pattern mining is intrinsically a frequent pattern mining 
problem [16]. It requires events to occur not only closely but 
also frequently closely on the graph. In TESC there is no 
such requirement and we could detect positively correlated 
rare event pairs. 

6. DISCUSSIONS 

A straightforward measure for TESC could be to calculate 
the average distance between nodes of the two events. Mea- 
sures of this kind try to capture the "distance" between the 
two events directly. However, for these direct measures it is 
difficult to estimate their distributions in the null hypoth- 
esis (i.e. no correlation). An empirical approach is to use 
randomization: perturbing events a and 6 independently in 
the graph with the observed sizes and internal structures, 
and calculating the empirical distribution of the measure. 
Unfortunately, it is hard to preserve each event's internal 
structure, thus making randomization not effective. Our 
approach avoids randomization by indirectly measuring the 
rank correlation between two events' densities in local neigh- 
borhoods of sampled reference nodes. Significance can be 
estimated by r's nice property of being asymptotically nor- 
mal under the null hypothesis. Our approach provides a 
systematic way to compute formal and rigorous statistical 
significance, rather than empirical one. 

Another simple idea is that we first map nodes in a graph 
to a Euclidean space by preserving the structural properties 
and then apply existing techniques for spatial data. Nev- 
ertheless, (1) techniques for spatial data are not scalable; 
(2) mapping introduces approximation errors. For example, 
researchers tried to approximate network distances using a 
coordinate system [21, 26]. According to the recent work 
[26], one distance estimation costs 0.2/is. Let us take the 
most recent method for spatial data [23] as an example. 
It requires estimating the distances between each reference 
point and all event points. Consequently, for 500K event 
points and 900 reference points, the total time cost is 90s! 
Although we could build k-d tree indices [5] to improve effi- 
ciency, k-d tree only works well for low dimensional spaces. 
Reducing the dimensionality leads to a higher distance es- 
timation error [26], indicating a tradeoff between accuracy 
and efficiency. Our method avoids these annoying issues and 
provides a scalable solution over the exact structure. 

How to choose the sample size of reference nodes is a 
practical issue. While there is no theoretical criterion for 



choosing a proper sample size, in practice we can do cor- 
relation/independence simulations on a graph (like in Sec- 
tion 5.2) and choose a large enough sample size so that the 
recall is above a user defined threshold, e.g. 0.95. Recall 
is connected to the Type I and Type II errors in statistical 
tests, for independence and correlation respectively. 

Our method can assess correlations in different vicinity 
levels, i.e. h. Another scheme could be that we get rid of 
h by designing a weighted correlation measure where refer- 
ence nodes closer to event nodes have higher weights. This 
is challenging since we cannot directly make use of r's nice 
property of being asymptotically normal in the null case. 
Another possible extension is to consider event intensity on 
nodes, e.g. the frequency by which an author used a key- 
word. We leave these possible extensions for future work. 

7. RELATED WORK 

Our work is related to a branch of graph mining research 
which involves both graph structures and node attributes 
[27, 8, 20, 22, 16, 11]. Zhou et al. proposed a graph cluster- 
ing algorithm based on both structural and attribute simi- 
larities [27]. In [8], Ester et al. also investigated using node 
attribute data to improve graph clustering. Moser et al. 
introduced the problem of mining cohesive graph patterns 
which are defined as dense and connected subgraphs that 
have homogeneous node attribute values [20] . Although the 
above works considered both graph structures and node at- 
tributes, they did not explicitly study the relationships be- 
tween structures and attributes. Recently, Silva et al. [22] 
proposed a structural correlation pattern mining problem 
which aims to find pairs (S, V) where S is a frequent at- 
tribute set and V induces a dense subgraph. Each node in 
V contains all the attributes in S. However, this kind of 
correlation is too restrictive and strong. An attribute which 
occurs on \V\ — 1 nodes of V will be discarded, though it 
also has a positive correlation with attributes in S. While 
our approach allows a user to measure the structural cor- 
relation between any attributes freely. Kahn et al. studied 
the problem of mining a set of attributes which frequently 
co-occurred in local neighborhoods in a graph [16]. They 
also tried to assess the significance of the discovered pat- 
terns. Nevertheless, our problem is significantly different 
from theirs as shown in Sections 1 and 5.4. In [11], we pro- 
posed a measure based on hitting time to assess the struc- 
tural correlation within an attribute. Significance of the 
correlation is estimated via a normal approximation of the 
measure's distribution under the null hypothesis where vari- 
ances are estimated by simulations. However, this measure 
is not suitable for TESC in that if we adapt the measure to 
compute the affinity between two events its distribution in 
the null case is difficult to estimate by simulations. It is hard 
to preserve each event's internal structure when simulating 
independence between them. 

Our work is also related to assessing and testing the corre- 
lation between two spatial point patterns in spatial pattern 
analysis [7, 18, 23]. However, existing solutions for this sim- 
ilar problem cannot be applied directly to graph spaces due 
to following reasons: (1) proximity measures for continuous 
spatial spaces cannot be applied to graph spaces directly; 
(2) the fixed and discrete graph structure renders infeasible 
some popular testing methodologies such as randomly shift- 
ing one point pattern around the space [18]; (3) focusing on 
regions where points exist and uniformly sampling reference 
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points are trivial works in continuous spaces [23], while it is 
not in our problem due to the discrete nature of graphs. (4) 
in our case, scalability is an important issue, which existing 
methods for the point pattern problem failed to consider. 

8. CONCLUSIONS 

We studied the problem of measuring Two-Event Struc- 
tural Correlations (TESC) in graphs and proposed a novel 
measure and an efficient testing framework to address it. 
Given the occurrences of two events we choose uniformly 
a sample of reference nodes from the vicinity of all event 
nodes and compute for each reference node the densities of 
the two events in its vicinity respectively. Then we employ 
the Kendall's r rank correlation measure to compute the 
average concordance of density changes for the two events, 
over all pairs of reference nodes. Correlation significance can 
then be assessed by r's nice property of being asymptotically 
normal under the null hypothesis. We also proposed three 
different algorithms for efficiently sampling reference nodes. 
Another rank correlation statistic, Spearman's p [15] could 
also be used. We choose Kendall's r since it can provide 
an intuitive interpretation and also facilitate the derivation 
of the efficient importance sampling method. Finally, ex- 
periments on real graph datasets with both synthetic and 
real events demonstrated that the proposed TESC testing 
framework was not only efficacious, but also efficient and 
scalable. 
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